-
Notifications
You must be signed in to change notification settings - Fork 150
Guest tracing improvements to use tracing
crate
#844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
tracing
crate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a high level review, and what I've seen looks good :-)
d4327a8
to
ccaa14e
Compare
ccaa14e
to
cceb069
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First round review looks good to me. I am curious what the performance looks like with tracing vs without, maybe we could add a benchmark or something for this?
Also, is there any possibility that we don't flush spans/records after exiting the guest, and that some end up not being emitted?
Another thing to consider is log crate vs tracing crate. Should we ditch one? Or is there some mechanism that allows regular logs to be consumed by tracing crate: And should we expose any of these in guest C-api?
Also is the tracing buffer sizes configurable? Maybe it should be if it isn't, so users can tweak it in case it affects performance.
+1 |
Ok, I can do that.
Hmm, in my limited testing I haven't seen this case, but I wouldn't exclude the possibility.
I am not sure about the best approach is.
The tracing buffer is compile time configurable, which I agree is not ideal for customers. |
77bbba5
to
6d10d2e
Compare
6d10d2e
to
8ff5a1e
Compare
I've run some benchmarks locally and here are the results.
Runtimes:
Some thoughts:
|
8ff5a1e
to
f00d63e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me in general, but I think maybe we should consider what tunables we need
); | ||
// It is important that all the tracing events are produced after the tracing is initialized. | ||
#[cfg(feature = "trace_guest")] | ||
if max_log_level == LevelFilter::Trace { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why only init if trace
level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to avoid the performance penalty if it is not really needed.
This provides a way to selectively enable it at runtime for development purposes.
I wouldn't say it is yet ready for production, but it definitely is a step in the right direction, starting from what we had (something custom that needs maintenance), to something that uses the tracing
and opentelemetry
crates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if I wanted to turn on tracing for LevelFilter::DEBUG or LevelFilter::INFO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be != LevelFilter::Off
https://docs.rs/tracing/latest/tracing/level_filters/struct.LevelFilter.html#impl-LevelFilter
This custom subscriber stores the spans and events in a buffer initialized only when tracing is enabled. For each new span and event, a method is called on the custom subscriber which | ||
not only stores the data, but also keeps track of the hierarchy and dependencies between the other spans/events. | ||
|
||
When the storage space is filled, the guest triggers a VM Exit that sends the guest pointers to the host. The host can access the guest memory, get the data and parse it to create the `spans` and `events` using the `opentelemetry` crate which allows specifying the starting and ending timestamps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe an option could be to only emit logs during regular VMexits in order to minimize context switches. Or maybe another option could be to discard logs when the buffer is full in order to not incur extra context switches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is another option.
I prioritized not losing any info, but we might not care about that. A ringbuffer would do the trick here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. If this could be configurable that would be ideal i think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be a nice feature, but at this moment, the logic on the host relies on the spans not being lost so that it knows how to arrange
them correctly.
Each span contains info about it's parent span, if one of those is lost, then there's a gap.
A ringbuffer would work well with logs, but traces, I am not sure.
f00d63e
to
148470a
Compare
- This feature is not used separate from the mem_profile - All the unwind logic is now gated by mem_profile Signed-off-by: Doru Blânzeanu <[email protected]>
- The guest side does not use this type of OutBAction - The stack unwinding is done either way when the mem_profile feature is enabled Signed-off-by: Doru Blânzeanu <[email protected]>
Signed-off-by: Doru Blânzeanu <[email protected]>
- This helps with keeping code separate and easily gating it out Signed-off-by: Doru Blânzeanu <[email protected]>
- This steps cleans up codebase for the new way of tracing guests - The current method involves custom macros and logic that are not the best for maintainability Signed-off-by: Doru Blânzeanu <[email protected]>
- Define a separate struct that holds the functionality related to memory profiling of the guest Signed-off-by: Doru Blânzeanu <[email protected]>
- Rename TraceInfo to reflect only being used by mem_profile Signed-off-by: Doru Blânzeanu <[email protected]>
Signed-off-by: Doru Blânzeanu <[email protected]>
- Adds a type that implements the Subscriber trait of the tracing_core crate that allows the type to be set as the global Subscriber of the crate - This way we can handle the adding of new spans and events and store them where/how we want Signed-off-by: Doru Blânzeanu <[email protected]>
- implement add_span and event methods that store the info and report it to the host when the buffer gets full Signed-off-by: Doru Blânzeanu <[email protected]>
Signed-off-by: Doru Blânzeanu <[email protected]>
- Parse the spans and events coming from the guest and create corresponding spans and events from the host that mimics a single call from host - Create a `TraceContext` that handles a call into a guest Signed-off-by: Doru Blânzeanu <[email protected]>
148470a
to
908f482
Compare
- conditionally handle logs either through tracing or the dedicated VM exit based on whether tracing is initialized on the guest - modify `test_with_small_stack_and_heap` to 18kB because the `#[intrument]` attributes use more stack. Signed-off-by: Doru Blânzeanu <[email protected]>
Signed-off-by: Doru Blânzeanu <[email protected]>
Signed-off-by: Doru Blânzeanu <[email protected]>
908f482
to
6468526
Compare
I see this version using opentelemetry as an incremental step, after which additional tunables can be added to improve usability. Improvements:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really like the idea of using the industry standard solution for tracing!
); | ||
// It is important that all the tracing events are produced after the tracing is initialized. | ||
#[cfg(feature = "trace_guest")] | ||
if max_log_level == LevelFilter::Trace { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if I wanted to turn on tracing for LevelFilter::DEBUG or LevelFilter::INFO
); | ||
// It is important that all the tracing events are produced after the tracing is initialized. | ||
#[cfg(feature = "trace_guest")] | ||
if max_log_level == LevelFilter::Trace { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be != LevelFilter::Off
https://docs.rs/tracing/latest/tracing/level_filters/struct.LevelFilter.html#impl-LevelFilter
let mut k = heapless::String::<FK>::new(); | ||
let mut val = heapless::String::<FV>::new(); | ||
// Shorten key and value if they are bigger than the space allocated | ||
let _ = k.push_str(&f.name()[..usize::min(f.name().len(), k.capacity())]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a chance the key gets reduced to something that would overwrite and corrupt data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how, it is trimmed to the minimum between its length and the capacity of the buffer, so it shouldn't go beyond the capacity of the container.
let mut cfg = SandboxConfiguration::default(); | ||
cfg.set_heap_size(20 * 1024); | ||
cfg.set_stack_size(16 * 1024); | ||
cfg.set_stack_size(18 * 1024); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this for the extra space needed with traces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I discovered lately that this test failed after I rebased to latest main. After some investigation, it happens because of the traces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a couple of comments , I don't think that any of these are things that need to be addressed here but we should consider as future enhancements
let max_log_level = LevelFilter::iter() | ||
.nth(max_log_level as usize) | ||
.expect("Invalid log level"); | ||
init_logger(max_log_level); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we only call init_logger
if either #[not(cfg(feature = "trace_guest"))]
or if this is set as an else to if max_log_level != LevelFilter::Off
?
I don't think we should log if tracing is on (we should redirect and log::<level>!
macros to tracing (will this just happen by default?). Conversely if trace_guest
is off we should capture any trace::<level>!
macros as logs (see https://docs.rs/tracing/latest/tracing/#emitting-log-records).
Maybe the way we should deal with this is to always emit traces from the guest (and get rid of the logging functionality in the guest (which makes the comment above redundant). Likewise in the host we could always use tracing, that way a consumer would have full control over if they saw logs or traces?
This will build the guest binaries with the `trace_guest` feature enabled and move them to the appropriate location for use by the host. | ||
This builds the guest binaries with the `trace_guest` feature enabled and move them to the appropriate location for use by the host. | ||
|
||
**NOTE**: To enable the tracing in your application you need to use the `trace_guest` feature on the `hyperlight-guest-bin` and `hyperlight-guest` crates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is strictly true? I can trace the host without this feature?.
Ideally it would be nice to get rid of this feature entirely and have it the functionality enabled by the presence of a subscriber in the host but that is maybe something we can think about in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a way to capture, store and communicate to the host the captured traces.
By having this custom Subscriber, we have that logic defined by us, so we know the format to expect from the guest.
If we find a way to address this, then we don't need that feature
}; | ||
|
||
/// Maximum number of spans that the guest can store | ||
const MAX_NO_OF_SPANS: usize = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering if rather than having fixed sizes for these we could just allocate on the guest heap and then have them processed each time we have a natural VM Exit, probably not an issue for this PR, but if we were to unify all the guest logging into traces (see my other comment about this) then this would probably be a better approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this, I would like to expose a way to configure the space used.
So the guest can say what memory it can spare
Description
This PR closes #723, #704 and partially addresses #318.
These changes modify the way we perform guest tracing to use the
tracing
crate and its macros (instrument
, trace).How it works
Guest
What makes this possible is the implementation of the
Subscriber
trait in thehyperlight-guest-tracing
crate. By implementing it, we can now handle the capturing of spans and events and choose how to store them and when to export them to the host.The
GuestSubscriber
type that implementsSubscriber
keeps an internalTraceState
that holds all the needed information.Whenever a new span is created, entered or exited, a callback on the subscriber is called so that we can handle the functionality. The same happens with the events also.
Each time a new span or event is added to the internal state, we check whether the buffer got full and send them to the host to process.
Host
When the host detects a VM exit from the guest, it checks whether it contains tracing information in the
OutB
instruction.When tracing information is found, the host starts going through it and check against the local storage of spans.
The spans parents are set based on the information got from the host.
TODO
These do not correctly set the parents of the spans created in the host to the last one created in the guest before doing the VM exit
I need to find a way to propagate the context into the guest and back whenever it is needed. But using the Opentelemetry propagators needs
std
support which we do not have in the guest.Jaeger picture of a Guest call that calls back into the host